改进的TF-IDF关键词提取方法<br>Improved TF-IDF Keyword Extraction Algorithm
نویسندگان
چکیده
منابع مشابه
The DF-ICF Algorithm- Modified TF-IDF
The tf-idf is an algorithm which is generally used where massive data processing is done. Tf-idf is the weight given to a particular term within a document and it is proportional to the importance of the term. This paper aims to use the idea behind the tf-idf algorithm to design the df-icf algorithm which finds the importance of a particular document within the given corpus. General Terms DF-IC...
متن کاملExploiting Lexical Dependencies from Large-Scale Data for Better Shift-Reduce Constituency Parsing
This paper proposes a method to improve shift-reduce constituency parsing by using lexical dependencies. The lexical dependency information is obtained from a large amount of auto-parsed data that is generated by a baseline shift-reduce parser on unlabeled data. We then incorporate a set of novel features defined on this information into the shift-reduce parsing model. The features can help to ...
متن کاملA Parallel Pages Mining Approach: Combining URL Patterns and HTML Structures
刘奇,刘洋,孙茂松 (清华大学 计算机科学与技术系 智能技术与系统国家重点实验室,北京 100084) 摘要: 平行语料库是对机器翻译、跨语言信息检索等应用技术具有重要支撑作用的基础数据资源。虽然互 联网上的平行网页数量巨大且持续增长,但由于平行网站的异构性和复杂性,如何快速自动获取高质量的 平行网页进而构造平行语料库仍然是巨大的挑战。本文提出了一种 URL 模式与 HTML 结构相结合的平行网页 获取方法,首先利用 HTML结构实现平行网页的递归访问,其次使用 URL模式优化遍历平行网站的拓扑顺序, 从而实现高效准确的平行网页获取。在联合国与香港政府 1 两个平行网站上的实验表明,我们的方法相对传 统获取方法在获取时间上减少 50%以上,准确率提高 15%,并显著提高了机器翻译的质量(BLEU 值分别提 高 1.6 和 0.7 个百分点)。 关键词:平行网页获取;平行语料库;URL...
متن کاملClustering scRNA-Seq Data using TF-IDF
In this abstract, we propose several computational approaches for clustering scRNA-Seq data based on the Term Frequency Inverse Document Frequency (TF-IDF) transformation that has been successfully used in the field of text analysis. Empirical evaluation on simulated cell mixtures with different levels of complexity suggests that the TF-IDF methods consistently outperform existing scRNA-Seq clu...
متن کاملDeriving TF-IDF as a Fisher Kernel
The Dirichlet compound multinomial (DCM) distribution has recently been shown to be a good model for documents because it captures the phenomenon of word burstiness, unlike standard models such as the multinomial distribution. This paper investigates the DCM Fisher kernel, a function for comparing documents derived from the DCM. We show that the DCM Fisher kernel has components that are similar...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Computer Science and Application
سال: 2013
ISSN: 2161-8801,2161-881X
DOI: 10.12677/csa.2013.31012